Red Wine Quality Exploration by Hongyan Wang

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] "3" "4" "5" "6" "7" "8"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 8.40   3: 10  
##  1st Qu.: 9.50   4: 53  
##  Median :10.20   5:681  
##  Mean   :10.42   6:638  
##  3rd Qu.:11.10   7:199  
##  Max.   :14.90   8: 18

Most red wines have quality “5” or “6”. Most red wines have a pH between “3.210” and “3.400”. Most red wines have chlorides between “0.07” and “0.09”.

Above 90% of red wines have quality “5”,“6” or “7” The histogram for chlorides without first 5 % quantile and last 5 % quantile

I’m wondering whether the chlorides influence the quanlity of red wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## 
## FALSE  TRUE 
##  1595     4

The minimum of chlorides is 0.012 and the maximum of chlorides is 0.661, but most chlorides are between 0.07 and 0.09. In particular, most are below 0.45, and there are several outliers.

The histogram for chlorides without first 5 % quantile and last 5 % quantile

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## 
## FALSE  TRUE 
##  1591     8

The minimum of sulphates is 0.33 and the maximum of sulphates is 2.000,but most sulphates are between 0.55 and 0.73. In particular, most are below 1.5. There are some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

I wonder if the red wine quality has anything to do with the alcohol. Alcohol may have a big influence on wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## 
## FALSE  TRUE 
##  1590     9

Most wines have total.sulfur.dioxide below 150, there are some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## 
## FALSE  TRUE 
##  1588    11

Most wines have residual.sugar between 1.9 and 2.6, in particular, most are below 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## 
## FALSE  TRUE 
##  1595     4

Most wines have volatile.acidity between 0.39 and 0.64,there are several outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
## FALSE  TRUE 
##  1467   132

Some wines don’t have citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## 
## FALSE  TRUE 
##  1595     4

Most wines have free.sulfur below 60. There are several outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

pH and density look like normal distributed.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 13 features (X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol and quality.) The variable quality has levels “3”,“4”,“5”,“6”,“7”,“8”.

Most wines have quality “5”,“6”,“7”. Most wines have free.sulfur below 60. Most wines have volatile.acidity between 0.39 and 0.64. Most wines have residual.sugar between 1.9 and 2.6 Most wines have total.sulfur.dioxide below 150. Most sulphates are between 0.55 and 0.73. Most red wines have chlorides between “0.07” and “0.09”.

What is/are the main feature(s) of interest in your dataset?

The main feature I’m interested in is “quality”. I wonder which chemical properties influence the quality of red wine. So I would investigate the relationships between quality and other variables. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? I think alcohol may have a big influence on the quality. Also, I would investigate sulphates,chlorides,fixed.acidity,free.sulfur.dioxide, total.sulfur.dioxide, residual.sugar, citric.acid and volatile.acidity.pH and density look like normally distributed, I think they may have little influence on quality.
### Did you create any new variables from existing variables in the dataset? No, since I want to investigate whether these current chemical properties have influence on quality ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? Density and pH look normally distributed, so I guess they may have low influence on quality. The data are tidy, so I don’t change the form of the data. For total.sulfur.dioxide,volatile.acidity,chlorides, there may be some outliers.

Bivariate Plots Section

I want to look at scatter plots involving quality and other variables since scatterplots are one of the best ways to understand a bivariate relationship. Since the variable quality is discrete, I would use jitter plot.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$sulphates and as.numeric(pf$quality)
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

It seems that when sulphates incease from 0 to 0.9, the quality increases. After 0.9, when sulphates increase, the quality decrease.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$chlorides and as.numeric(pf$quality)
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

It’s hard to find a pattern for chlorides and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$alcohol and as.numeric(pf$quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Except several points, we can see the more alcohol, the higher quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$free.sulfur.dioxide and as.numeric(pf$quality)
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

free.sulfur.dioxide doesn’t have much influence on wines quality

## 
##  Pearson's product-moment correlation
## 
## data:  pf$fixed.acidity and as.numeric(pf$quality)
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

The fixed.acidity doesn’t have much influence on quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$volatile.acidity and as.numeric(pf$quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The more volatile.acidity the wines contain, the lower quality they have.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$total.sulfur.dioxide and as.numeric(pf$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

total.sulfur.dioxide doesn’t have much influence on wines quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$citric.acid and as.numeric(pf$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

citric.acid doesn’t have much influence on quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$residual.sugar and as.numeric(pf$quality)
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

The residual.sugar doesn’t have much influence on quality

## 
##  Pearson's product-moment correlation
## 
## data:  pf$pH and as.numeric(pf$quality)
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

As expected, the pH doen’t have much influence on quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$density and as.numeric(pf$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

Also, the density doen’t have much influence on quality.

It seems that free.sulfur.dioxide and total.sulfur.dioxide have a linear relationship.

Now I want to use box plots to explore the relationships between quality and alcohol, sulphates and volatile.acidity. From this box plot, we can see that the wines whose qualities are high have high alcohol.

We can see that the wines whose qualities are high have high sulphates.

we can see that the wines whose qualities are high have low volatile.acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I find that the features chlorides,fixed.acidity,residual.sugar, pH,free.sulfur.dioxide, citric.acid, total.sulfur.dioxide and density don’t have much influence on the quality of wines.

The quality of red wines is related to alcohol, sulphates and volatile.acidity.

The more alcohol the wines contain, the higher quality they have.For sulphates, it looks that the quality of wines increases as sulphates increase when sulphates < 0.9, then the quality of wines decreases as sulphates increase. ### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

total.sulfur.dioxide and free.sulfur.dioxide are linearly related.The more free.sulfur.dioxide the wines contain, the more total.sulfur.dioxide they contain.

Also, it seems that the more citric.acid the wines contain, the less volatile.acidity they contain.

What was the strongest relationship you found?

The strongest relationship is that quality of wines is linearly correlated with alcohol. The quality and sulphates are also correlated

Multivariate Plots Section

Look at the summary for variable sulphates first.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Then I break sulphates into two buckets (0.33,0.7] and (0.7,2]

I plot the scatter plot for alcohol and quality, colored by sulphates.

Look at the summary of alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Break alcohol into 3 buckets (8.4,10.2],(10,2,10.42],(10.42,14.90]

I plot the relationship between sulphates and quality, colored by alcohol

Plot the scatter plot for volatile.acidity and quality, colored by alcohol

Plot the scatter plot for volatile.acidity and quality, colored by sulphates

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I find that the more alcohol the wines contain, the higher quality they have. When I plot the relationship between alcohol and wine quality color by other variables like sulphates, they still follow this pattern.

Similarly, I find that the more volatile.acidity the wines contain, the lower quality they have. When I plot the relationship between alcohol and wine quality color by other variables like sulphates, they still follow this pattern.

Were there any interesting or surprising interactions between features?

The relationship between quality and alcohol is a little surprising to me. Before I deal with the data, I thought that for low alcohol wines, they have low quality and high quality; for high alcohol wines, they also have low quality and high quality. But I find that for red wines, the more alcohol they contain, the higher quality they have.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity, 
##     data = pf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7186 -0.3820 -0.0641  0.4746  2.1807 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.61083    0.19569   3.121  0.00183 ** 
## alcohol           0.30922    0.01580  19.566  < 2e-16 ***
## sulphates         0.67903    0.10080   6.737 2.26e-11 ***
## volatile.acidity -1.22140    0.09701 -12.591  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6587 on 1595 degrees of freedom
## Multiple R-squared:  0.3359, Adjusted R-squared:  0.3346 
## F-statistic: 268.9 on 3 and 1595 DF,  p-value: < 2.2e-16

The R2 for this linear model is about 0.33, so this model is just fine, but not very good. But we can still see the coefficients for alcohol, sulphates and volatile.acidity. The coeficient for alcohol is 0.31, the coefficient for sulphates is 0.67 and the coefficient for volatile.acidity is -1.22.


Final Plots and Summary

Plot One

Description One

I choose this plot because I’m investigating the relationships between quality and other chemical properties and it would be good to know the distribution of quality. We can find that the quality of red wines are discrete numbers, in particular, they are “3”,“4”,“5”,“6”,“7” and “8”. Then notice that most red wines have quality “5” and “6”. These two facts will increase the difficulties of finding wich chemical properties influence the quality of red wines.

Plot Two

Description Two

I choose this plot because this boxplot clearly shows The red wine qualities are highly related to alcohol. We can see the more alcohol the wines contain, the higher quality they have. For wines whose qualities are 3,4 or 5, the mean of alcohol is about 10, for wines whose qualities are 6, the mean of alcohol is about 10.5, for wines whose qualities are 7, the mean of alcohol is about 11.5, for wines whose qualities are 8, the mean of alcohol is about 12.1. From the linear model, we know the coefficient for alcohol is 0.30922, which confirms to the plot.

Plot Three

I choose this plot because it clearly shows that the red wine qualities are negatively related to volatile acidity. From the box plot for quality and volatile.acidity, we can clearly see that the more volatile.acidity the wines contain, the lower quality they have. From the linear model, we know the coefficient for volatile.acidity is -1.22140, which confirms to the plot.


Reflection

The red wines dataset contains 1559 observations and 15 variables. First, I started by understanding each variable in the dataset. Since I want to investigate the which chemical properties influence the quality of the red wines, I understand the quality variable first. I noticed that the quality variable are discrete numbers, in particular, most red wines have quality “5” or “6”. This means it will be difficult to find a clear relationship between quality and other variables since other variables have continuous values. Then I investigated the relationship between quality and other 13 variables one by one. For some variables, such as pH and density, I expected that they won’t influence the quality of wines. And the result also shows that they don’t have much influence on quality. For some variables, such as fixed.acidity, I expect that they will influence the quality of wines. But after some investigations, I didn’t find they have any clear relationship. It seems that the value of these variables don’t influence the quality of wines. After investigating all the variables, I find that the alcohol has much influence on quality, which is a little surprising to me. I find that the more alcohol the red wines contain, the highter quality they have.

Main struggles:

  1. The qualities for most wines are 5 or 6, which makes me feel it’s very difficult to find a clear relationship between quality and other chemical properties.
  2. When I do the scatter plot, which I think is a very good method to find the bivariable relationship, they look very messy. It’s hard to find the parterns for their relationship.
  3. When I explore the multivariable relationships, I try to plot the scatter plot for two variables, colored by the third variable. This makes the plots even worse.

Main successes:

  1. I try to investigate variables one by one, and find some variables have nothing to do with quality and some variables are highly correlated to quality.
  2. I find some variables may be related to each other( like total.sulfur.dioxide and free.sulfur.dioxide). So if I want to build the linear model, I will only use one of them.
  3. I use box plots to find some clear relationships between some variables(like alcohol) and quality. This makes me know which variables I should use to build my model.
  4. The linear model I build is not perfect, but it also gives me a good sense about the relationships between quality and other variables.

Future work:

Since the quality of wines are discrete (“3”,“4”,“5”,“6”,“7”,“8”), I think it’s a good idea to use classification algorithm to explore which chemical property influences the quality of wines. I can even use these classifier models to predict the quality of wines.